INTRODUCTION

Students’ academic performance is shaped by a myriad of factors—ranging from the hours they study and the support they receive at home, to socioeconomic conditions and school quality. Understanding these relationships can offer insights into how best to improve student outcomes and inform resource allocation. In this project, we analyze a dataset of 6,607 students, each described by 20 variables that capture both academic and personal attributes, with the aim of discovering which factors most strongly correlate with final exam performance.

We begin with exploratory data analysis to understand the distribution of study hours, attendance, previous scores, and other quantitative predictors, as well as the influence of categorical factors such as parental involvement and access to resources. From there, we fit a variety of statistical and machine-learning models—linear regression, generalized additive models (GAMs), Poisson regression, logistic regression, support vector machines (SVMs), and neural networks—to compare their predictive power and interpretability. Our pipeline includes data cleaning steps, handling missing values, and even oversampling techniques (e.g., SMOTE) for imbalanced classes. By systematically exploring both conventional and more advanced approaches, we aim to illuminate key patterns in student achievement, providing a multi-faceted view of how diverse factors—from motivation level to family income—can help or hinder academic success.

DATA LOADING

Students <- read.csv(
  "StudentPerformanceFactors.csv",
  header = TRUE,
  stringsAsFactors = TRUE
)

str(Students) 
'data.frame':   6607 obs. of  20 variables:
 $ Hours_Studied             : int  23 19 24 29 19 19 29 25 17 23 ...
 $ Attendance                : int  84 64 98 89 92 88 84 78 94 98 ...
 $ Parental_Involvement      : Factor w/ 3 levels "High","Low","Medium": 2 2 3 2 3 3 3 2 3 3 ...
 $ Access_to_Resources       : Factor w/ 3 levels "High","Low","Medium": 1 3 3 3 3 3 2 1 1 3 ...
 $ Extracurricular_Activities: Factor w/ 2 levels "No","Yes": 1 1 2 2 2 2 2 2 1 2 ...
 $ Sleep_Hours               : int  7 8 7 8 6 8 7 6 6 8 ...
 $ Previous_Scores           : int  73 59 91 98 65 89 68 50 80 71 ...
 $ Motivation_Level          : Factor w/ 3 levels "High","Low","Medium": 2 2 3 3 3 3 2 3 1 3 ...
 $ Internet_Access           : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
 $ Tutoring_Sessions         : int  0 2 2 1 3 3 1 1 0 0 ...
 $ Family_Income             : Factor w/ 3 levels "High","Low","Medium": 2 3 3 3 3 3 2 1 3 1 ...
 $ Teacher_Quality           : Factor w/ 4 levels "","High","Low",..: 4 4 4 4 2 4 4 2 3 2 ...
 $ School_Type               : Factor w/ 2 levels "Private","Public": 2 2 2 2 2 2 1 2 1 2 ...
 $ Peer_Influence            : Factor w/ 3 levels "Negative","Neutral",..: 3 1 2 1 2 3 2 1 2 3 ...
 $ Physical_Activity         : int  3 4 4 4 4 3 2 2 1 5 ...
 $ Learning_Disabilities     : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
 $ Parental_Education_Level  : Factor w/ 4 levels "","College","High School",..: 3 2 4 3 2 4 3 3 2 3 ...
 $ Distance_from_Home        : Factor w/ 4 levels "","Far","Moderate",..: 4 3 4 3 4 4 3 2 4 3 ...
 $ Gender                    : Factor w/ 2 levels "Female","Male": 2 1 2 2 1 2 2 2 2 2 ...
 $ Exam_Score                : int  67 61 74 71 70 71 67 66 69 72 ...
#head(Students)

The dataset contains 6607 individual student records and 20 variables that represent both academic and personal attributes. Among these variables, there are integer columns—such as Hours_Studied, Attendance, Sleep_Hours, Previous_Scores, Tutoring_Sessions, Physical_Activity, and Exam_Score—that capture quantitative measures of each student’s study habits and performance (with exam scores likely on a scale from 0 to 100). Additionally, a number of columns are categorical (factors), covering aspects such as Parental_Involvement, Access_to_Resources, Motivation_Level, Internet_Access, and Extracurricular_Activities, which help describe the environment and resources available to students. The dataset also includes demographic or contextual variables such as Family_Income, Teacher_Quality, School_Type, Peer_Influence, Gender, and Distance_from_Home, plus a special factor noting if the student has Learning_Disabilities. Lastly, Parental_Education_Level ranges from High School to Postgraduate (with some missing categories), which further situates a student’s socio-academic background. Together, these variables form a rich dataset for analyzing how diverse factors—ranging from study intensity to family context—relate to final exam performance.

EXPLORATORY DATA ANALYSIS AND BASIC DATA CHECKS

Summary statistics for each column

captured_output<-capture.output(summary(Students))
partial_output<-captured_output[1:4]
cat(partial_output,sep ="\n")
 Hours_Studied     Attendance     Parental_Involvement Access_to_Resources
 Min.   : 1.00   Min.   : 60.00   High  :1908          High  :1975        
 1st Qu.:16.00   1st Qu.: 70.00   Low   :1337          Low   :1313        
 Median :20.00   Median : 80.00   Medium:3362          Medium:3319        

Check for missing values in each column

missing_counts<-colSums(is.na(Students))
captured_output<-capture.output(missing_counts)
partial_output<-captured_output[1:4]
cat(partial_output,sep ="\n")
             Hours_Studied                 Attendance 
                         0                          0 
      Parental_Involvement        Access_to_Resources 
                         0                          0 

The summary table of all 21 variables in the dataset shows both numeric distributions (via minimum, quartiles, median, mean, and maximum) and category counts. For example, Hours_Studied ranges from 1 to 44 hours (median around 20), while Attendance lies between 60% and 100%, with a median of 80%. Categorical variables such as Parental_Involvement, Access_to_Resources, and Family_Income are split into Low, Medium, and High groups, each with its own count of records. Other variables, like Extracurricular_Activities (“Yes” vs. “No”) and Gender (“Male” vs. “Female”), reflect binary or multi-category data. The table also highlights key academic factors (e.g., Sleep_Hours around 4–10, Exam_Score ranging 55–101) and personal attributes (e.g., Motivation_Level, Learning_Disabilities) to provide a broad view of each student’s context.

Notably, there are no missing values in any of the variables.

WHY DIFFERENT MODELS?

We began our analysis by fitting linear regression models to understand the fundamental relationships between continuous features (e.g., hours studied, attendance) and exam scores,allowing us to quickly identify which predictors had statistically significant linear effects.Recognizing signs of non-linearity in variables like Previous_Scores, we turned to Generalized Additive Models (GAMs), which provide additional flexibility by modeling non-linear relationships without overly complex transformations.Since some aspects of student performance data could be viewed as count-like outcomes (e.g., number of questions correct),we briefly explored Poisson regression to account for discrete distributions. For binary outcomes, we applied logistic regression, which is well-suited for binary classification and interpretable. To handle more complex, possibly non-linear decision boundaries in predicting performance categories, we utilized Support Vector Machines (SVMs), leveraging kernels (radial, linear, etc.) to find optimal separating hyperplanes. Finally, we experimented with Neural Networks because they can automatically learn intricate, non-linear patterns from multiple features and may capture interactions that linear methods can miss.By sequentially introducing these methods, we balanced interpretability with predictive power, tailoring each model choice to the nature of the data and specific research questions.

DATA CLEANUP AND DROPPING UNWANTED ROWS/COLUMNS

In this step, we addressed data inconsistencies and missing values to ensure our analysis would be both accurate and efficient. First, we replaced any exam scores of “101” with “100” to correct what was likely an error or outlier.

Students$Exam_Score[Students$Exam_Score == 101] <- 100

Next, we checked for missing and null values across all columns, which verified that there were no cells containing NA or NULL.

sum(is.na(Students))
[1] 0
sum(is.null(Students))
[1] 0

We then removed rows where the factors for “Teacher_Quality” and “Parental_Education_Level” were empty strings, as those observations lacked meaningful information for our models.

Students <- Students[Students$Teacher_Quality != "" & 
                     Students$Parental_Education_Level != "", ]

Finally, we attempted to drop rows with empty strings for “Distance_from_Home” but found that this left us with no data, so instead, we removed the entire column. These cleaning steps helped ensure that our dataset was consistent and complete before proceeding with more advanced analyses.

Students <- Students %>%
  select(-Distance_from_Home)

EXPLORATORY VISUALS: EFFECT OF CATEGORICAL VARIABLES ON EXAM_SCORE

In this section, we explored how various categorical factors—such as parental involvement, access to resources, motivation level, family income, and others—relate to students’ exam scores. By creating boxplots for each categorical variable, we could quickly discern how the distribution of exam scores differs across groups (e.g., “Low,” “Medium,” and “High” for certain factors). The code loops through the list of these categorical variables and generates individual boxplots, each highlighting potential differences in mean scores, overall score spread, and outliers. Finally, we arranged all of these plots in a grid layout to facilitate easy comparisons between different predictors. This visual analysis serves as a preliminary step to identify which categories might have the strongest relationships with student performance.


```r
# Define a vector of categorical columns
categorical_cols <- c("Parental_Involvement", "Access_to_Resources", 
                      "Motivation_Level", "Family_Income", "Teacher_Quality", 
                      "School_Type", "Peer_Influence", "Learning_Disabilities", 
                      "Parental_Education_Level", "Gender")

plots <- list()

# Loop through each categorical variable and create a ggplot boxplot
for (cat_var in categorical_cols) {
  p <- ggplot(Students, aes(x = .data[[cat_var]], y = Exam_Score)) +
    geom_boxplot(fill = "lightblue") +
    labs(title = paste("Exam Score by", cat_var),
         x = cat_var,
         y = "Exam Score") +
    theme_minimal() +
    theme(axis.text.x = element_text(hjust = 1))  # Rotate x-axis labels for readability
  plots[[cat_var]] <- p
}

# Adjust the number of columns depending on the screen space available
do.call(grid.arrange, c(plots, ncol = 3))
```


Higher categories of involvement or resources often showed elevated median scores, while students with learning disabilities registered slightly lower medians (though there are notable exceptions). Interestingly, school type (public vs. private) did not exhibit a dramatic effect, and the distributions for female and male students overlapped substantially, indicating a relatively minor difference in performance by gender. Ultimately, these figures highlight the importance of home environment, peer support, and motivation in shaping student achievement.

LINEAR MODELS

We begun with a simple model containing only an intercept and progressively introduced categorical and continuous predictors (e.g., Parental_Involvement, Hours_Studied). At each step, we checked for statistical significance (using anova and drop1) and collinearity (using vif), removing non-significant variables (such as Gender, School_Type, and Sleep_Hours). We also explored potential interactions (e.g., Gender * Hours_Studied), retaining them only if they improved the model. This stepwise approach helped refine the linear regression model to identify key predictors of students’ exam scores while maintaining good model fit and low collinearity.


```r
# We fit a model that does not take into account any predictor
Exam_Scores.0 <- lm(Exam_Score ~ 1, data = Students)
coef(Exam_Scores.0)
```

```
(Intercept) 
   67.24461 
```

```r
# Testing several categorical variables
Exam_Scores.1 <- lm(Exam_Score ~ Parental_Involvement, data = Students)
coef(Exam_Scores.1)
```

```
               (Intercept)    Parental_InvolvementLow 
                68.1004836                 -1.7251958 
Parental_InvolvementMedium 
                -0.9961835 
```

```r
summary(Exam_Scores.1)
```

```

Call:
lm(formula = Exam_Score ~ Parental_Involvement, data = Students)

Residuals:
    Min      1Q  Median      3Q     Max 
-12.104  -2.104  -0.104   1.900  33.625 

Coefficients:
                           Estimate Std. Error t value Pr(>|t|)    
(Intercept)                68.10048    0.08946 761.223   <2e-16 ***
Parental_InvolvementLow    -1.72520    0.13941 -12.375   <2e-16 ***
Parental_InvolvementMedium -0.99618    0.11201  -8.894   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.859 on 6440 degrees of freedom
Multiple R-squared:  0.02453,   Adjusted R-squared:  0.02423 
F-statistic: 80.99 on 2 and 6440 DF,  p-value: < 2.2e-16
```

Looking at the p-values, we can see that there is strong evidence that the mean of Exam Score for Low and Medium Parental involvement differ from High Parental Involvement.

We then compared the models with Anova.

anova(Exam_Scores.0, Exam_Scores.1)
Analysis of Variance Table

Model 1: Exam_Score ~ 1
Model 2: Exam_Score ~ Parental_Involvement
  Res.Df   RSS Df Sum of Sq      F    Pr(>F)    
1   6442 98333                                  
2   6440 95920  2    2412.5 80.986 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We have strong evidence that including Parental involvement can help to better represent the data.

We then added other predictors to the model, and performed collinearity checks.

Exam_Scores.2 <- update(Exam_Scores.1, 
  . ~ . + Access_to_Resources + Extracurricular_Activities +
    Motivation_Level + Internet_Access + Family_Income + Teacher_Quality + 
    School_Type + Peer_Influence + Learning_Disabilities + 
    Parental_Education_Level + Gender)

# collinearity check 
#vif(Exam_Scores.2)

The Generalized Variance Inflation Factors (GVIFs) were all very close to 1 (see Appendix), which indicated that there is no serious collinearity issue in the model.

To check whether it made sense to keep the categorical values we used the drop1() function.

drop1(Exam_Scores.2, test = "F") 
Single term deletions

Model:
Exam_Score ~ Parental_Involvement + Access_to_Resources + Extracurricular_Activities + 
    Motivation_Level + Internet_Access + Family_Income + Teacher_Quality + 
    School_Type + Peer_Influence + Learning_Disabilities + Parental_Education_Level + 
    Gender
                           Df Sum of Sq   RSS   AIC  F value    Pr(>F)    
<none>                                  87272 16831                       
Parental_Involvement        2   2553.16 89825 17012  93.9531 < 2.2e-16 ***
Access_to_Resources         2   3010.44 90282 17045 110.7807 < 2.2e-16 ***
Extracurricular_Activities  1    427.63 87700 16860  31.4724 2.107e-08 ***
Motivation_Level            2    772.86 88045 16884  28.4404 5.045e-13 ***
Internet_Access             1    263.83 87536 16848  19.4175 1.067e-05 ***
Family_Income               2    853.78 88126 16889  31.4182 2.640e-14 ***
Teacher_Quality             2    604.80 87877 16871  22.2560 2.332e-10 ***
School_Type                 1      3.07 87275 16829   0.2257    0.6348    
Peer_Influence              2    895.10 88167 16892  32.9385 5.859e-15 ***
Learning_Disabilities       1    680.82 87953 16879  50.1067 1.611e-12 ***
Parental_Education_Level    2   1099.26 88371 16907  40.4513 < 2.2e-16 ***
Gender                      1      4.39 87276 16829   0.3230    0.5699    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

There is strong evidence that all the categorical variables except for Gender and School type play a role in impacting the Exam_Score.

We tested continuous and categorical variables by updating the previous model, and checked again for collinearity.


```r
Exam_Scores.3 <- update(Exam_Scores.2,
  . ~ . + Hours_Studied + Attendance +
    Sleep_Hours + Previous_Scores + Tutoring_Sessions +
    Physical_Activity)

drop1(Exam_Scores.3, test = "F")
```

```
Single term deletions

Model:
Exam_Score ~ Parental_Involvement + Access_to_Resources + Extracurricular_Activities + 
    Motivation_Level + Internet_Access + Family_Income + Teacher_Quality + 
    School_Type + Peer_Influence + Learning_Disabilities + Parental_Education_Level + 
    Gender + Hours_Studied + Attendance + Sleep_Hours + Previous_Scores + 
    Tutoring_Sessions + Physical_Activity
                           Df Sum of Sq   RSS     AIC   F value    Pr(>F)    
                                  27813  9475.0                        
Parental_Involvement        2      3095 30908 10150.8  357.0442 < 2.2e-16 ***
Access_to_Resources         2      3308 31121 10195.0  381.5767 < 2.2e-16 ***
Extracurricular_Activities  1       469 28282  9580.6  108.1030 < 2.2e-16 ***
Motivation_Level            2       869 28682  9669.2  100.2518 < 2.2e-16 ***
Internet_Access             1       380 28194  9560.5   87.7770 < 2.2e-16 ***
Family_Income               2      1034 28847  9706.2  119.2723 < 2.2e-16 ***
Teacher_Quality             2       651 28464  9620.0   75.0610 < 2.2e-16 ***
School_Type                 1         1 27814  9473.2    0.2382    0.6255    
Peer_Influence              2       951 28764  9687.6  109.6678 < 2.2e-16 ***
Learning_Disabilities       1       438 28251  9573.7  101.0631 < 2.2e-16 ***
Parental_Education_Level    2       930 28743  9683.0  107.3019 < 2.2e-16 ***
Gender                      1         3 27816  9473.7    0.7056    0.4009    
Hours_Studied               1     19865 47679 12945.6 4583.2846 < 2.2e-16 ***
Attendance                  1     33966 61780 14614.9 7836.6059 < 2.2e-16 ***
Sleep_Hours                 1         0 27813  9473.0    0.0063    0.9368    
Previous_Scores             1      3200 31014 10174.7  738.3529 < 2.2e-16 ***
Tutoring_Sessions           1      2462 30276 10019.6  568.1318 < 2.2e-16 ***
Physical_Activity           1       245 28058  9529.5   56.5190 6.326e-14 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

```r
vif(Exam_Scores.3)
```

```
                               GVIF Df GVIF^(1/(2*Df))
Parental_Involvement       1.008099  2        1.002019
Access_to_Resources        1.009556  2        1.002381
Extracurricular_Activities 1.003310  1        1.001653
Motivation_Level           1.007831  2        1.001952
Internet_Access            1.003360  1        1.001679
Family_Income              1.007315  2        1.001824
Teacher_Quality            1.007168  2        1.001787
School_Type                1.003959  1        1.001978
Peer_Influence             1.008180  2        1.002039
Learning_Disabilities      1.002891  1        1.001445
Parental_Education_Level   1.007408  2        1.001847
Gender                     1.002369  1        1.001184
Hours_Studied              1.002553  1        1.001275
Attendance                 1.005263  1        1.002628
Sleep_Hours                1.002973  1        1.001485
Previous_Scores            1.006470  1        1.003230
Tutoring_Sessions          1.001716  1        1.000857
Physical_Activity          1.007046  1        1.003517
```

There is strong evidence that all predictors except for Sleep Hours, Gender and School type play a role therefore we want to drop these three variables. Collinearity check confirmed that all GVIF are below 5, meaning that there is no serious collinearity issue in the model.

We then tested whether adding interactions between certain variables (Gender and Hours_Studied, Gender and Peer_Influence, Motivation_Level and Hours_Studied) significantly improved the linear model predicting Exam_Score.


```r
Exam_Scores.3.1 <- update(Exam_Scores.3,
                          . ~ . + Gender*Hours_Studied)

Exam_Scores.3.2 <- update(Exam_Scores.3,
                          . ~ . + Gender*Peer_Influence)

summary(Exam_Scores.3.2)
```

```

Call:
lm(formula = Exam_Score ~ Parental_Involvement + Access_to_Resources + 
    Extracurricular_Activities + Motivation_Level + Internet_Access + 
    Family_Income + Teacher_Quality + School_Type + Peer_Influence + 
    Learning_Disabilities + Parental_Education_Level + Gender + 
    Hours_Studied + Attendance + Sleep_Hours + Previous_Scores + 
    Tutoring_Sessions + Physical_Activity + Peer_Influence:Gender, 
    data = Students)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.8710 -0.4827 -0.1453  0.1805 30.1921 

Coefficients:
                                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)                          42.456303   0.335367 126.596  < 2e-16 ***
Parental_InvolvementLow              -1.983898   0.075428 -26.302  < 2e-16 ***
Parental_InvolvementMedium           -1.056796   0.060534 -17.458  < 2e-16 ***
Access_to_ResourcesLow               -2.061120   0.075162 -27.422  < 2e-16 ***
Access_to_ResourcesMedium            -0.992281   0.060103 -16.510  < 2e-16 ***
Extracurricular_ActivitiesYes         0.551374   0.052988  10.406  < 2e-16 ***
Motivation_LevelLow                  -1.060903   0.075467 -14.058  < 2e-16 ***
Motivation_LevelMedium               -0.538835   0.068630  -7.851 4.79e-15 ***
Internet_AccessYes                    0.919685   0.098119   9.373  < 2e-16 ***
Family_IncomeLow                     -1.092389   0.071933 -15.186  < 2e-16 ***
Family_IncomeMedium                  -0.590801   0.071992  -8.206 2.73e-16 ***
Teacher_QualityLow                   -1.043410   0.094650 -11.024  < 2e-16 ***
Teacher_QualityMedium                -0.548117   0.058192  -9.419  < 2e-16 ***
School_TypePublic                     0.027102   0.056437   0.480    0.631    
Peer_InfluenceNeutral                 0.505869   0.107432   4.709 2.54e-06 ***
Peer_InfluencePositive                0.972776   0.106570   9.128  < 2e-16 ***
Learning_DisabilitiesYes             -0.851185   0.084782 -10.040  < 2e-16 ***
Parental_Education_LevelHigh School  -0.477774   0.059883  -7.979 1.74e-15 ***
Parental_Education_LevelPostgraduate  0.500655   0.074759   6.697 2.31e-11 ***
GenderMale                           -0.091717   0.114567  -0.801    0.423    
Hours_Studied                         0.293927   0.004342  67.688  < 2e-16 ***
Attendance                            0.199287   0.002252  88.501  < 2e-16 ***
Sleep_Hours                          -0.001497   0.017701  -0.085    0.933    
Previous_Scores                       0.049095   0.001807  27.175  < 2e-16 ***
Tutoring_Sessions                     0.501630   0.021051  23.829  < 2e-16 ***
Physical_Activity                     0.189712   0.025282   7.504 7.03e-14 ***
Peer_InfluenceNeutral:GenderMale      0.035610   0.142257   0.250    0.802    
Peer_InfluencePositive:GenderMale     0.084271   0.141646   0.595    0.552    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.082 on 6415 degrees of freedom
Multiple R-squared:  0.7172,    Adjusted R-squared:  0.716 
F-statistic: 602.5 on 27 and 6415 DF,  p-value: < 2.2e-16
```

Both interactions yielded very small effect estimates and high p-values, indicating no statistically significant difference in how peer influence relates to exam scores by gender.

In addition to the interaction terms, variables such as School_TypePublic, Sleep_Hours, and GenderMale also showed high p-values, suggesting that these factors are not statistically significant predictors of exam scores in this dataset. We therefore removed them from our dataset.

Students1 <- dplyr::select(Students, -School_Type, -Gender, -Sleep_Hours)

We then updated our model accordingly and added the remaining variables and performed collinearity check.


```r
Exam_Scores.4 <- lm(Exam_Score ~ Parental_Involvement+ Access_to_Resources + Extracurricular_Activities +
                      Motivation_Level + Internet_Access + Family_Income + Teacher_Quality + 
                      Peer_Influence + Learning_Disabilities + Parental_Education_Level + 
                      Hours_Studied + Attendance + Previous_Scores + Tutoring_Sessions +
                      Physical_Activity, data = Students1)

vif(Exam_Scores.4)
```

```
                               GVIF Df GVIF^(1/(2*Df))
Parental_Involvement       1.006995  2        1.001744
Access_to_Resources        1.007852  2        1.001957
Extracurricular_Activities 1.003226  1        1.001612
Motivation_Level           1.007434  2        1.001853
Internet_Access            1.002890  1        1.001444
Family_Income              1.006778  2        1.001690
Teacher_Quality            1.006975  2        1.001739
Peer_Influence             1.007183  2        1.001791
Learning_Disabilities      1.002300  1        1.001149
Parental_Education_Level   1.006635  2        1.001655
Hours_Studied              1.002292  1        1.001145
Attendance                 1.004467  1        1.002231
Previous_Scores            1.005712  1        1.002852
Tutoring_Sessions          1.001498  1        1.000749
Physical_Activity          1.006863  1        1.003425
```

All values are extremely close to 1, indicating that none of the predictors exhibits problematic levels of collinearity.

We explored potential interaction effects by testing a model that incorporates an interaction term between motivation level and hours studied.


```r
Exam_Scores.5 <- update(Exam_Scores.4,
                        . ~ . + Motivation_Level*Hours_Studied)
summary(Exam_Scores.5)
```

```

Call:
lm(formula = Exam_Score ~ Parental_Involvement + Access_to_Resources + 
    Extracurricular_Activities + Motivation_Level + Internet_Access + 
    Family_Income + Teacher_Quality + Peer_Influence + Learning_Disabilities + 
    Parental_Education_Level + Hours_Studied + Attendance + Previous_Scores + 
    Tutoring_Sessions + Physical_Activity + Motivation_Level:Hours_Studied, 
    data = Students1)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.8431 -0.4825 -0.1456  0.1878 30.2002 

Coefficients:
                                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)                          42.727732   0.345107 123.810  < 2e-16 ***
Parental_InvolvementLow              -1.983279   0.075371 -26.314  < 2e-16 ***
Parental_InvolvementMedium           -1.054724   0.060504 -17.432  < 2e-16 ***
Access_to_ResourcesLow               -2.058117   0.075101 -27.405  < 2e-16 ***
Access_to_ResourcesMedium            -0.990247   0.060034 -16.495  < 2e-16 ***
Extracurricular_ActivitiesYes         0.549189   0.052949  10.372  < 2e-16 ***
Motivation_LevelLow                  -1.625001   0.262704  -6.186 6.56e-10 ***
Motivation_LevelMedium               -0.834034   0.237051  -3.518 0.000437 ***
Internet_AccessYes                    0.917142   0.098040   9.355  < 2e-16 ***
Family_IncomeLow                     -1.091414   0.071872 -15.186  < 2e-16 ***
Family_IncomeMedium                  -0.587659   0.071934  -8.169 3.70e-16 ***
Teacher_QualityLow                   -1.046080   0.094609 -11.057  < 2e-16 ***
Teacher_QualityMedium                -0.547117   0.058148  -9.409  < 2e-16 ***
Peer_InfluenceNeutral                 0.524074   0.070390   7.445 1.09e-13 ***
Peer_InfluencePositive                1.019012   0.070083  14.540  < 2e-16 ***
Learning_DisabilitiesYes             -0.850683   0.084700 -10.044  < 2e-16 ***
Parental_Education_LevelHigh School  -0.475996   0.059829  -7.956 2.09e-15 ***
Parental_Education_LevelPostgraduate  0.500784   0.074685   6.705 2.18e-11 ***
Hours_Studied                         0.278114   0.009712  28.635  < 2e-16 ***
Attendance                            0.199256   0.002250  88.574  < 2e-16 ***
Previous_Scores                       0.049057   0.001805  27.177  < 2e-16 ***
Tutoring_Sessions                     0.502376   0.021035  23.883  < 2e-16 ***
Physical_Activity                     0.189691   0.025265   7.508 6.82e-14 ***
Motivation_LevelLow:Hours_Studied     0.028466   0.012679   2.245 0.024796 *  
Motivation_LevelMedium:Hours_Studied  0.014972   0.011437   1.309 0.190536    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.081 on 6418 degrees of freedom
Multiple R-squared:  0.7173,    Adjusted R-squared:  0.7163 
F-statistic: 678.6 on 24 and 6418 DF,  p-value: < 2.2e-16
```

The regression model reveals that exam scores are significantly influenced by various factors, including parental involvement, access to resources, motivation level, family income, teacher quality, and hours studied. Notably, the interaction term between motivation level and hours studied indicates that students with low motivation benefit differently from studying compared to those with higher motivation. The model explains 71.7% of the variance in exam scores, showing strong predictive power. Key predictors like hours studied and previous scores positively impact performance, while challenges such as low family income and limited resources negatively affect outcomes. This underscores the importance of addressing both personal and environmental factors to improve student achievement.

The drop1 function was used to assess the impact of dropping each term from the model by performing an F-test and comparing the resulting models.


```r
drop1(Exam_Scores.5, test = "F")
```

```
Single term deletions

Model:
Exam_Score ~ Parental_Involvement + Access_to_Resources + Extracurricular_Activities + 
    Motivation_Level + Internet_Access + Family_Income + Teacher_Quality + 
    Peer_Influence + Learning_Disabilities + Parental_Education_Level + 
    Hours_Studied + Attendance + Previous_Scores + Tutoring_Sessions + 
    Physical_Activity + Motivation_Level:Hours_Studied
                               Df Sum of Sq   RSS     AIC   F value    Pr(>F)
                                      27795  9468.9                    
Parental_Involvement            2      3090 30885 10144.0  356.7107 < 2.2e-16
Access_to_Resources             2      3297 31092 10187.1  380.6289 < 2.2e-16
Extracurricular_Activities      1       466 28261  9574.0  107.5797 < 2.2e-16
Internet_Access                 1       379 28174  9554.1   87.5111 < 2.2e-16
Family_Income                   2      1034 28830  9700.2  119.3931 < 2.2e-16
Teacher_Quality                 2       653 28449  9614.5   75.4029 < 2.2e-16
Peer_Influence                  2       948 28743  9680.9  109.4287 < 2.2e-16
Learning_Disabilities           1       437 28232  9567.3  100.8719 < 2.2e-16
Parental_Education_Level        2       925 28721  9675.9  106.8465 < 2.2e-16
Attendance                      1     33977 61772 14612.1 7845.2881 < 2.2e-16
Previous_Scores                 1      3199 30994 10168.7  738.6079 < 2.2e-16
Tutoring_Sessions               1      2470 30266 10015.4  570.3809 < 2.2e-16
Physical_Activity               1       244 28040  9523.2   56.3713 6.815e-14
Motivation_Level:Hours_Studied  2        22 27817  9470.0    2.5401   0.07894
                                  
                            
Parental_Involvement           ***
Access_to_Resources            ***
Extracurricular_Activities     ***
Internet_Access                ***
Family_Income                  ***
Teacher_Quality                ***
Peer_Influence                 ***
Learning_Disabilities          ***
Parental_Education_Level       ***
Attendance                     ***
Previous_Scores                ***
Tutoring_Sessions              ***
Physical_Activity              ***
Motivation_Level:Hours_Studied .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

The output shows that most predictors have highly significant contributions to the model. We also found an interaction between motivation level and hours studied, however there is extremely weak (no) evidence (0.077) that the new model could bring an improvement.

CHECK FOR NON-LINEARITY (QUADRATIC, GAM)

We tested the effect of hours of study on the exam scores:

gg.students <- ggplot(data = Students1,
                      mapping = aes(y = Exam_Score,
                                    x = Previous_Scores)) +
  geom_point()
gg.students + geom_smooth()
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

A close look revealed that the relationship might not be fully linear, therefore we decided to model the non-linearity shown by the smoother with a quadratic effect and tested quadratic term with an F-test.

Exam_Scores.6 <- update(Exam_Scores.4, . ~ . + I(Previous_Scores^2))
anova(Exam_Scores.4,Exam_Scores.6)
Analysis of Variance Table

Model 1: Exam_Score ~ Parental_Involvement + Access_to_Resources + Extracurricular_Activities + 
    Motivation_Level + Internet_Access + Family_Income + Teacher_Quality + 
    Peer_Influence + Learning_Disabilities + Parental_Education_Level + 
    Hours_Studied + Attendance + Previous_Scores + Tutoring_Sessions + 
    Physical_Activity
Model 2: Exam_Score ~ Parental_Involvement + Access_to_Resources + Extracurricular_Activities + 
    Motivation_Level + Internet_Access + Family_Income + Teacher_Quality + 
    Peer_Influence + Learning_Disabilities + Parental_Education_Level + 
    Hours_Studied + Attendance + Previous_Scores + Tutoring_Sessions + 
    Physical_Activity + I(Previous_Scores^2)
  Res.Df   RSS Df Sum of Sq      F  Pr(>F)  
1   6420 27817                              
2   6419 27793  1    24.468 5.6511 0.01747 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

There is some evidence that Previous Scores needs a quadratic term, so we updated our model accordingly.


```r
Exam_Scores.6 <- lm(Exam_Score ~ Parental_Involvement + Access_to_Resources + Extracurricular_Activities + 
                      Motivation_Level + Internet_Access + Family_Income + Teacher_Quality + 
                      Peer_Influence + Learning_Disabilities + Parental_Education_Level + 
                      Hours_Studied + Attendance + Previous_Scores + Tutoring_Sessions + 
                      Physical_Activity +
                      poly(Previous_Scores, degree = 2),
                    data = Students)


Exam_Scores.7 <- gam(Exam_Score~ Parental_Involvement + Access_to_Resources + Extracurricular_Activities + 
                       Motivation_Level + Internet_Access + Family_Income + Teacher_Quality + s(Hours_Studied) +
                       Peer_Influence + Learning_Disabilities + Parental_Education_Level + 
                       + s(Attendance) + s(Previous_Scores) + Tutoring_Sessions + 
                       Physical_Activity,
                     data = Students)

summary(Exam_Scores.7)
```

```

Family: gaussian 
Link function: identity 

Formula:
Exam_Score ~ Parental_Involvement + Access_to_Resources + Extracurricular_Activities + 
    Motivation_Level + Internet_Access + Family_Income + Teacher_Quality + 
    s(Hours_Studied) + Peer_Influence + Learning_Disabilities + 
    Parental_Education_Level + +s(Attendance) + s(Previous_Scores) + 
    Tutoring_Sessions + Physical_Activity

Parametric coefficients:
                                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)                          67.91360    0.17469 388.766  < 2e-16 ***
Parental_InvolvementLow              -1.98174    0.07534 -26.305  < 2e-16 ***
Parental_InvolvementMedium           -1.05872    0.06043 -17.521  < 2e-16 ***
Access_to_ResourcesLow               -2.05717    0.07507 -27.402  < 2e-16 ***
Access_to_ResourcesMedium            -0.99366    0.05999 -16.563  < 2e-16 ***
Extracurricular_ActivitiesYes         0.54468    0.05293  10.290  < 2e-16 ***
Motivation_LevelLow                  -1.06099    0.07535 -14.080  < 2e-16 ***
Motivation_LevelMedium               -0.53862    0.06852  -7.861 4.44e-15 ***
Internet_AccessYes                    0.92309    0.09796   9.423  < 2e-16 ***
Family_IncomeLow                     -1.09784    0.07187 -15.275  < 2e-16 ***
Family_IncomeMedium                  -0.58928    0.07190  -8.195 2.99e-16 ***
Teacher_QualityLow                   -1.04919    0.09450 -11.103  < 2e-16 ***
Teacher_QualityMedium                -0.54929    0.05813  -9.449  < 2e-16 ***
Peer_InfluenceNeutral                 0.52100    0.07036   7.404 1.49e-13 ***
Peer_InfluencePositive                1.02018    0.07001  14.571  < 2e-16 ***
Learning_DisabilitiesYes             -0.84501    0.08463  -9.985  < 2e-16 ***
Parental_Education_LevelHigh School  -0.47980    0.05978  -8.026 1.19e-15 ***
Parental_Education_LevelPostgraduate  0.49669    0.07463   6.655 3.07e-11 ***
Tutoring_Sessions                     0.50003    0.02102  23.790  < 2e-16 ***
Physical_Activity                     0.19186    0.02526   7.595 3.51e-14 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Approximate significance of smooth terms:
                     edf Ref.df       F p-value    
s(Hours_Studied)   3.057  3.870 1187.73  <2e-16 ***
s(Attendance)      1.000  1.000 7857.48  <2e-16 ***
s(Previous_Scores) 7.501  8.443   89.82  <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-sq.(adj) =  0.717   Deviance explained = 71.8%
GCV = 4.3417  Scale est. = 4.3204    n = 6443
```

```r
plot(Exam_Scores.7, residuals = TRUE, select = 1)
```


Furthermore, we applied the GAM function and as a result the EDF suggested that a polynomial of degree 7 is needed for Previous Scores, a polynomial of degree 3 is needed for hours studied, and the linear relationship is confirmed for Attendance.

POISSON REGRESSION FOR COUNT DATA

For count data, we used a Poisson model to make predictions of exam scores. Poisson is more suitable than linear prediction for count data as it predicts positive integers with adjusted variability:


```r
Exam_Scores.8.1 <- glm(Exam_Score ~ Parental_Involvement + Access_to_Resources + 
                         Motivation_Level + Family_Income + Teacher_Quality + Hours_Studied +
                         Peer_Influence + Parental_Education_Level + 
                         + Attendance + Previous_Scores + Tutoring_Sessions + 
                         Physical_Activity,
                       family = "poisson",
                       data = Students)
summary(Exam_Scores.8.1)
```

```

Call:
glm(formula = Exam_Score ~ Parental_Involvement + Access_to_Resources + 
    Motivation_Level + Family_Income + Teacher_Quality + Hours_Studied + 
    Peer_Influence + Parental_Education_Level + +Attendance + 
    Previous_Scores + Tutoring_Sessions + Physical_Activity, 
    family = "poisson", data = Students)

Coefficients:
                                       Estimate Std. Error z value Pr(>|z|)    
(Intercept)                           3.8539287  0.0165835 232.395  < 2e-16 ***
Parental_InvolvementLow              -0.0293888  0.0044198  -6.649 2.94e-11 ***
Parental_InvolvementMedium           -0.0153534  0.0035294  -4.350 1.36e-05 ***
Access_to_ResourcesLow               -0.0302961  0.0044080  -6.873 6.29e-12 ***
Access_to_ResourcesMedium            -0.0146861  0.0035046  -4.191 2.78e-05 ***
Motivation_LevelLow                  -0.0161690  0.0044144  -3.663 0.000250 ***
Motivation_LevelMedium               -0.0079402  0.0040066  -1.982 0.047507 *  
Family_IncomeLow                     -0.0160175  0.0042021  -3.812 0.000138 ***
Family_IncomeMedium                  -0.0084681  0.0042000  -2.016 0.043776 *  
Teacher_QualityLow                   -0.0157905  0.0055508  -2.845 0.004445 ** 
Teacher_QualityMedium                -0.0081776  0.0034007  -2.405 0.016185 *  
Hours_Studied                         0.0043737  0.0002540  17.216  < 2e-16 ***
Peer_InfluenceNeutral                 0.0076275  0.0041378   1.843 0.065273 .  
Peer_InfluencePositive                0.0152673  0.0041150   3.710 0.000207 ***
Parental_Education_LevelHigh School  -0.0072527  0.0035060  -2.069 0.038578 *  
Parental_Education_LevelPostgraduate  0.0074753  0.0043593   1.715 0.086384 .  
Attendance                            0.0029650  0.0001318  22.490  < 2e-16 ***
Previous_Scores                       0.0007299  0.0001058   6.901 5.18e-12 ***
Tutoring_Sessions                     0.0074266  0.0012273   6.051 1.44e-09 ***
Physical_Activity                     0.0027335  0.0014793   1.848 0.064629 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 1424.81  on 6442  degrees of freedom
Residual deviance:  395.45  on 6423  degrees of freedom
AIC: 39397

Number of Fisher Scoring iterations: 3
```

Significant predictors include Parental_InvolvementLow, Access_to_ResourcesLow, and Hours_Studied, among others, with their p-values indicating strong evidence against the null hypothesis. For example, Hours_Studied positively impacts exam scores significantly (p < 0.001). The intercept represents the baseline log-exam score when all predictors are at their reference levels. The model’s residual deviance (395.45) suggests a good fit compared to the null deviance (1424.81). The AIC (39397) helps assess model parsimony. The model is suitable for count data, as it predicts positive integers with adjusted variability.


```r
set.seed(2)
sim.data.Students.Poisson <- simulate(Exam_Scores.8.1)
NROW(sim.data.Students.Poisson)
```

```
[1] 6443
```

```r
#head(sim.data.Students.Poisson)
#tail(sim.data.Students.Poisson)

min(Students$Exam_Score)
```

```
[1] 55
```

```r
max(Students$Exam_Score)
```

```
[1] 100
```

Using the Poisson model, simulated exam scores were generated to mimic real-world observations. This was done with the simulate() function in R. Simulated exam scores based on the Poisson model are all positive integers, as expected. The dataset contains 6,443 rows, reflecting the number of students. The minimum and maximum observed scores are 55 and 100, respectively, aligning with real-world exam score ranges. These simulated scores are consistent with the Poisson assumption and the model’s predictions.

LOGISTIC (BINOMIAL) REGRESSION

We first normalized the Exam_Score variable by dividing its values by 100, converting it from a percentage scale (e.g., 0 to 100) to a decimal scale (e.g., 0.0 to 1.0).

Students <- Students %>%
  mutate(Exam_Score = Exam_Score / 100)

Then, we created a binary variable Result, categorizing scores below 0.60 as “fail” (0) and scores 0.60 or above as “pass” (1).

Students <- Students %>%
  mutate(Result = if_else(Exam_Score < 0.60, 0, 1))
unique(Students$Result)
[1] 1 0
summary(Students$Exam_Score)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.5500  0.6500  0.6700  0.6724  0.6900  1.0000 
glm(Result ~ Hours_Studied,
    family = "binomial",
    data = Students)

Call:  glm(formula = Result ~ Hours_Studied, family = "binomial", data = Students)

Coefficients:
  (Intercept)  Hours_Studied  
      0.01226        0.30213  

Degrees of Freedom: 6442 Total (i.e. Null);  6441 Residual
Null Deviance:      745.2 
Residual Deviance: 563.6    AIC: 567.6
ggplot(data = Students,
       mapping = aes(y = Result, x = Hours_Studied)) +
  geom_point() +
  geom_smooth(method = "glm", method.args = list(family = "binomial"), se = FALSE) +
  ylim(0, 1) +
  geom_hline(yintercept = 0:1)
`geom_smooth()` using formula = 'y ~ x'

Using logistic regression, we modeled Result as a function of Hours_Studied and found that increased study hours significantly improved the likelihood of passing, with a positive coefficient of 0.30213.

In this binomial logistic regression analysis, the dataset is focused exclusively on students who scored above 50%, with Result being a binary outcome: 1 for students with an exam score of 60% or higher and 0 for those below. The plotted logistic regression line represents the probability of passing as a function of hours studied. Since the data only includes students with a positive grade, the logistic curve does not extend below 0.5, reflecting the absence of observations with Result = 0 for very low exam scores, thus limiting the model’s predictive range.

SUPPORT VECTOR MACHINES

To prepare our data for s Support Vector Machine model, we categorized the Exam_Score variable into three groups—Low, Medium, and High. Using the case_when() function, scores less than or equal to 0.65 were labeled as Low, scores between 0.65 and 0.70 were labeled as Medium, and scores greater than 0.70 were labeled as High. This categorization provides a structured target variable for multi-class classification.

Students <- Students %>%
        mutate(Exam_Score_Category = case_when(
                Exam_Score <= 0.65 ~ "Low",
                Exam_Score > 0.65 & Exam_Score <= 0.70 ~ "Medium",
                Exam_Score > 0.70 ~ "High"
        ))

To evaluate the classification performance of Support Vector Machines (SVM), we tested two different models: one with a linear kernel and another with a radial kernel. Both models aimed to predict the Exam_Score_Category (High, Medium, Low) based on features such as Hours_Studied and Attendance. The dataset was first split into training (90%) and testing (10%) subsets, and 10-fold cross-validation was used on the training data to validate the models and assess their performance.

## SVM Linear kernel with 10-CV validation 

set.seed(123)

# Split data into training and test sets
indices <- createDataPartition(Students$Exam_Score_Category, p = 0.9, list = FALSE)

train <- Students %>%
        slice(indices)
Warning: Slicing with a 1-column matrix was deprecated in dplyr 1.1.0.
This warning is displayed once every 8 hours.
Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
generated.
train <- train %>%
        mutate(Exam_Score_Category = as.factor(Exam_Score_Category))

# Initialize a vector to store accuracies for 10-fold cross-validation
folds <- createFolds(train$Exam_Score_Category, k = 10)
cv_accuracies <- numeric(length(folds))

# Perform 10-fold cross-validation
for (i in seq_along(folds)) {
        # Create training and validation subsets for the current fold
        fold_val <- train[folds[[i]], ]
        fold_train <- train[-folds[[i]], ]
        
        # Train SVM model with linear kernel on the current training subset
        svm_fold <- svm(
                Exam_Score_Category ~ Hours_Studied + Attendance,
                data = fold_train,
                kernel = "linear",
                scale = TRUE,
                cost = 10
        )
        
        # Predict on the validation subset
        fold_val_pred <- predict(svm_fold, fold_val[, c("Hours_Studied", "Attendance")])
        
        # Calculate accuracy for the current fold
        fold_val_truth <- fold_val$Exam_Score_Category
        fold_val_acc <- sum(fold_val_pred == fold_val_truth) / nrow(fold_val)
        cv_accuracies[i] <- fold_val_acc
}

# Print fold-by-fold accuracies and mean accuracy
cat(sprintf("\nMean CV Accuracy (10-Fold): %.3f\n", mean(cv_accuracies)))

Mean CV Accuracy (10-Fold): 0.743
# Train the final SVM model on the full training set
final_svm <- svm(
        Exam_Score_Category ~ Hours_Studied + Attendance,
        data = train,
        kernel = "linear",
        scale = TRUE,
        cost = 10
)

# Prepare the test set
test <- Students %>%
        slice(-indices)

# Predict on the test set
test_pred <- predict(final_svm, test[, c("Hours_Studied", "Attendance")])

# Create confusion matrix
conf_matrix <- confusionMatrix(as.factor(test_pred), as.factor(test$Exam_Score_Category))

# Print the confusion matrix
cat("\nConfusion Matrix on Test Data:\n")

Confusion Matrix on Test Data:
print(conf_matrix)
Confusion Matrix and Statistics

          Reference
Prediction High Low Medium
    High     55   0     18
    Low       2 143     41
    Medium   49  64    272

Overall Statistics
                                          
               Accuracy : 0.7298          
                 95% CI : (0.6937, 0.7638)
    No Information Rate : 0.514           
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.5352          
                                          
 Mcnemar's Test P-Value : 8.772e-05       

Statistics by Class:

                     Class: High Class: Low Class: Medium
Sensitivity               0.5189     0.6908        0.8218
Specificity               0.9665     0.9016        0.6390
Pos Pred Value            0.7534     0.7688        0.7065
Neg Pred Value            0.9107     0.8603        0.7722
Prevalence                0.1646     0.3214        0.5140
Detection Rate            0.0854     0.2220        0.4224
Detection Prevalence      0.1134     0.2888        0.5978
Balanced Accuracy         0.7427     0.7962        0.7304

The linear SVM was trained with a cost parameter of 10 and achieved a mean cross-validation accuracy of 74.3%. On the test set, it achieved an overall accuracy of 72.98%, as shown in the confusion matrix. The model performed well in classifying “High” and “Low” categories but showed weaker performance for the “Medium” category.

An SVM model with a radial kernel was then trained to classify Exam_Score_Category (High, Medium, Low) based on Hours_Studied and Attendance. To address class imbalance, SMOTE (Synthetic Minority Oversampling Technique) was applied to oversample the minority classes in the training data during 10-fold cross-validation. This ensured a balanced class distribution in each fold, improving the model’s ability to classify underrepresented categories.

## Radial kernel with 10-CV Validation 
set.seed(123)

# Split data into training and test sets
indices <- createDataPartition(Students$Exam_Score_Category, p = 0.9, list = FALSE)
train <- Students[indices, ]
test <- Students[-indices, ]

# Ensure target variable is a factor in the training set
train$Exam_Score_Category <- as.factor(train$Exam_Score_Category)

# Initialize a vector to store accuracies for 10-fold cross-validation
folds <- createFolds(train$Exam_Score_Category, k = 10)
cv_accuracies <- numeric(length(folds))


# Perform 10-fold cross-validation
for (i in seq_along(folds)) {
        #cat(sprintf("\n--- Fold %d ---\n", i))
        
        # Create training and validation subsets for the current fold
        fold_val <- train[folds[[i]], ]
        fold_train <- train[-folds[[i]], ]
        
        # Train SVM model with radial kernel on the current training subset
        svm_fold <- svm(
                Exam_Score_Category ~ Hours_Studied + Attendance,
                data = fold_train,
                kernel = "radial",
                gamma = 5,
                cost = 10,
                scale = TRUE
        )
        
        # Suppress the plot for cross-validation folds
        # plot(svm_fold, train, Attendance ~ Hours_Studied)
        
        # Predict on the validation subset
        fold_val_pred <- predict(svm_fold, fold_val[, c("Hours_Studied", "Attendance")])
        
        # Calculate accuracy for the current fold
        fold_val_truth <- fold_val$Exam_Score_Category
        fold_val_acc <- sum(fold_val_pred == fold_val_truth) / nrow(fold_val)
        cv_accuracies[i] <- fold_val_acc
        
        #cat(sprintf("Accuracy for fold %d: %.3f\n", i, fold_val_acc))
}

# Print fold-by-fold accuracies and mean accuracy
#cat("\nFold-by-Fold Accuracies:", paste(round(cv_accuracies, 3), collapse = ", "))
cat(sprintf("\nMean CV Accuracy (10-Fold): %.3f\n", mean(cv_accuracies)))

Mean CV Accuracy (10-Fold): 0.742
# Train the final SVM model on the entire training set
svm_2 <- svm(
        Exam_Score_Category ~ Hours_Studied + Attendance,
        data = train,
        kernel = "radial",
        gamma = 5,
        cost = 10,
        scale = TRUE
)

# Make predictions on the test set
test_in <- test[, c("Hours_Studied", "Attendance")]
test_truth <- as.factor(test$Exam_Score_Category)
test_pred <- predict(svm_2, test_in)
#plot


```r
plot(svm_2, train, Attendance ~ Hours_Studied)
```


# Evaluate the final model's performance on the test set
conf_matrix <- confusionMatrix(as.factor(test_pred), test_truth)
cat("\nConfusion Matrix on Test Data:\n")

Confusion Matrix on Test Data:
print(conf_matrix)
Confusion Matrix and Statistics

          Reference
Prediction High Low Medium
    High     57   0     22
    Low       2 138     39
    Medium   47  69    270

Overall Statistics
                                          
               Accuracy : 0.722           
                 95% CI : (0.6857, 0.7563)
    No Information Rate : 0.514           
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.5228          
                                          
 Mcnemar's Test P-Value : 0.0002269       

Statistics by Class:

                     Class: High Class: Low Class: Medium
Sensitivity              0.53774     0.6667        0.8157
Specificity              0.95911     0.9062        0.6294
Pos Pred Value           0.72152     0.7709        0.6995
Neg Pred Value           0.91327     0.8516        0.7636
Prevalence               0.16460     0.3214        0.5140
Detection Rate           0.08851     0.2143        0.4193
Detection Prevalence     0.12267     0.2780        0.5994
Balanced Accuracy        0.74842     0.7864        0.7226

The final radial kernel SVM with SMOTE achieved an overall test accuracy of 72.2%, as shown in the confusion matrix. Sensitivity and specificity metrics indicate improved performance for the minority “High” class compared to the non-SMOTE approach. The classification plot demonstrates the decision boundaries created by the SVM, effectively separating data points into their respective categories based on the input features. This approach highlights the effectiveness of combining SMOTE with an SVM to handle imbalanced datasets.

We then evaluated an SVM model with a radial kernel to classify Exam_Score_Category (High, Medium, Low) using Hours_Studied and Attendance as predictors. To address class imbalance, we applied SMOTE (Synthetic Minority Oversampling Technique) during 10-fold cross-validation, oversampling minority categories in the training data to improve classification accuracy.

## 10-FOLD CROSS VALIDATION (ADD-ON) FOR SVM WITH SMOTE
cat("### 10-Fold Cross Validation on 'train_data' (Oversampled per fold) ###\n")
### 10-Fold Cross Validation on 'train_data' (Oversampled per fold) ###
# 1) Create 10 folds from your existing 'train_data'
set.seed(123)  # for reproducibility
folds <- createFolds(train$Exam_Score_Category, k = 10)

# 2) Initialize a vector to store accuracies
cv_accuracies <- numeric(length(folds))

# 3) Loop through each fold
i <- 1
for (fold_idx in folds) {
        
        #cat("\n--- Fold", i, "---\n")
        
        # 'fold_idx' are the row indices for the validation portion
        fold_val <- train[ fold_idx, ]
        fold_train <- train[-fold_idx, ]
        
        # 3a) Perform SMOTE on 'fold_train' (like your original code does)
        smote_result_fold <- SMOTE(
                X = fold_train[, c("Hours_Studied", "Attendance")],
                target = fold_train$Exam_Score_Category,
                K = 5,
                dup_size = 2
        )
        oversampled_fold <- smote_result_fold$data
        
        # Convert 'class' to factor
        oversampled_fold$class <- as.factor(oversampled_fold$class)
        
        # 3b) Train SVM on oversampled data
        svm_fold <- ksvm(
                as.matrix(oversampled_fold[, c("Hours_Studied", "Attendance")]),
                oversampled_fold$class,
                type = "C-svc",
                kernel = "rbfdot",
                C = 200,
                scaled = c()
        )
        
        # 3c) Predict on 'fold_val'
        fold_val_pred <- predict(
                svm_fold, 
                newdata = as.matrix(fold_val[, c("Hours_Studied", "Attendance")])
        )
        
        #3d) Calculate accuracy on this fold
        fold_val_acc <- sum(fold_val_pred == fold_val$Exam_Score_Category) / nrow(fold_val)
        cv_accuracies[i] <- fold_val_acc
        i <- i + 1
}

# 4) Print out all fold accuracies and their mean
cat(sprintf("\nMean CV Accuracy (10-Fold): %.3f\n", mean(cv_accuracies)))

Mean CV Accuracy (10-Fold): 0.705
# Generate plots for visualization after cross-validation
#cat("\n### Generating Decision Boundary Plot ###\n")

# Generate grid for visualization
hours_grid <- seq(min(train$Hours_Studied), max(train$Hours_Studied), length.out = 200)
attendance_grid <- seq(min(train$Attendance), max(train$Attendance), length.out = 200)
grid_points <- expand.grid(Hours_Studied = hours_grid, Attendance = attendance_grid)

# Use the last trained SVM model for predictions (from the final fold)
grid_pred <- predict(svm_fold, newdata = as.matrix(grid_points), type = "response")
grid_points$Predicted_Class <- as.factor(grid_pred)
grid_points$Predicted_Class_Numeric <- as.numeric(grid_pred)

# Plot decision boundary and training data
plot <- ggplot() +
        geom_tile(data = grid_points, aes(x = Hours_Studied, y = Attendance, fill = Predicted_Class), alpha = 0.5) +
        geom_contour(data = grid_points, aes(x = Hours_Studied, y = Attendance, z = Predicted_Class_Numeric), 
                     color = "black", size = 0.2, linetype = "solid") +
        geom_point(data = train, aes(x = Hours_Studied, y = Attendance, color = as.factor(Exam_Score_Category)), size = 2) +
        scale_fill_brewer(palette = "Set1", name = "Predicted Class") +
        scale_color_brewer(palette = "Set1", name = "True Class") +
        labs(
                title = "SVM Decision Boundary with rbfdot Kernel and Contour Lines",
                x = "Hours Studied",
                y = "Attendance"
        ) +
        theme_minimal()
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
This warning is displayed once every 8 hours.
Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
generated.
print(plot)

The SVM model was trained on each fold, and its performance was assessed through cross-validation, achieving a mean accuracy of 70.5%. Additionally, we visualized the decision boundaries of the final model, which highlighted the model’s ability to classify data points into categories based on feature values.

The graph shows that high attendance (>90%) and over 30 study hours lead to high performance, while low attendance (<70%) and under 10 hours result in low performance. Medium performers fall in between, highlighting the need to balance attendance and study hours to improve outcomes for at-risk students.

test_pred <- predict(
  svm_fold,
  newdata = as.matrix(test[, c("Hours_Studied", "Attendance")])
)

conf_matrix <- confusionMatrix(
  as.factor(test_pred),
  as.factor(test$Exam_Score_Category)
)

# Print the confusion matrix
cat("\nConfusion Matrix on Test Data:\n")

Confusion Matrix on Test Data:
print(conf_matrix)
Confusion Matrix and Statistics

          Reference
Prediction High Low Medium
    High     86   1     79
    Low       2 133     38
    Medium   18  73    214

Overall Statistics
                                          
               Accuracy : 0.6724          
                 95% CI : (0.6346, 0.7085)
    No Information Rate : 0.514           
    P-Value [Acc > NIR] : 2.867e-16       
                                          
                  Kappa : 0.4781          
                                          
 Mcnemar's Test P-Value : 9.119e-11       

Statistics by Class:

                     Class: High Class: Low Class: Medium
Sensitivity               0.8113     0.6425        0.6465
Specificity               0.8513     0.9085        0.7093
Pos Pred Value            0.5181     0.7688        0.7016
Neg Pred Value            0.9582     0.8429        0.6549
Prevalence                0.1646     0.3214        0.5140
Detection Rate            0.1335     0.2065        0.3323
Detection Prevalence      0.2578     0.2686        0.4736
Balanced Accuracy         0.8313     0.7755        0.6779

The model was then tested on a separate test set, achieving an overall accuracy of 67.24%, as shown in the confusion matrix. The class-wise balanced accuracy was highest for the “High” category (83.13%) and slightly lower for the “Low” (77.55%) and “Medium” (67.79%) categories.

In general, the 10-fold cross-validation demonstrated consistent performance, with mean accuracies indicating reliable generalization across folds. The linear kernel showed clear separation in simpler data structures, while the radial kernel, with its non-linear decision boundaries, performed better in capturing more complex patterns.

The radial kernel SVM outperformed the linear kernel, achieving higher cross-validation accuracy and better capturing non-linear relationships between hours studied and attendance. This suggests the radial kernel is better suited for this dataset’s complexity.

NEURAL NETWORKS

We trained the neural network to classify Exam_Score_Category (High, Medium, Low) based on predictors such as Hours_Studied. The data was preprocessed by normalizing the features, applying one-hot encoding to the target variable, and using SMOTE to address class imbalance by oversampling minority categories. The neural network was configured with two hidden layers (7 and 5 nodes) and a learning rate of 0.01. A 10-fold cross-validation approach was used to evaluate the model’s performance, with accuracy and confusion matrices calculated for each fold.

# Normalization -----
normalize <- function(x) {
        if (max(x) == min(x)) return(rep(0, length(x)))  
        (x - min(x)) / (max(x) - min(x))
}

one_hot_target <- function(df) {
        df %>%
                mutate(
                        High   = ifelse(Exam_Score_Category == "High", 1, 0),
                        Medium = ifelse(Exam_Score_Category == "Medium", 1, 0),
                        Low    = ifelse(Exam_Score_Category == "Low", 1, 0)
                ) %>%
                select(-Exam_Score_Category)  
}
# Define the main training and evaluation function 
do_training_and_eval <- function(train_raw_data, test_raw_data) {
        # Preprocess the train set
        train_prep_local <- train_raw_data %>%
                one_hot_target() %>%
                mutate(across(where(is.factor), as.numeric)) %>%
                mutate(across(where(is.numeric), normalize)) %>%
                mutate(
                        Category = factor(
                                case_when(
                                        High == 1   ~ "High",
                                        Medium == 1 ~ "Medium",
                                        Low == 1    ~ "Low"
                                ),
                                levels = c("High", "Medium", "Low")
                        )
                )
        
        # oversampling the training set
        train_upsampled_local <- upSample(
                x    = train_prep_local %>% select(-High, -Medium, -Low, -Category),
                y    = train_prep_local$Category,
                list = FALSE
        ) %>%
                mutate(
                        High   = ifelse(Class == "High", 1, 0),
                        Medium = ifelse(Class == "Medium", 1, 0),
                        Low    = ifelse(Class == "Low", 1, 0)
                ) %>%
                select(-Class)
        
        # Preprocess the test set
        test_prep_local <- test_raw_data %>%
                one_hot_target() %>%
                mutate(across(where(is.factor), as.numeric)) %>%
                mutate(across(where(is.numeric), normalize))
        
        # Separate input and output for test
        test_in_local    <- test_prep_local %>% select(-High, -Medium, -Low)
        test_truth_local <- test_prep_local %>% select(High, Medium, Low)
        
        # Train the neural network
        set.seed(123)
        Students_net <- neuralnet(
                High + Medium + Low ~ .,
                data = train_upsampled_local,
                hidden = c(7, 5),  # 2 hidden layers (5 nodes, then 4 nodes)
                learningrate = 0.01,
                linear.output = FALSE
        )
        # Plot the neural network
        plot(Students_net, rep = "best")
        
        # Predict on test fold
        predictions_local <- compute(Students_net, test_in_local)$net.result
        
        # Convert outputs to class labels
        predicted_classes_local <- apply(predictions_local, 1, function(row) {
                c("High", "Medium", "Low")[which.max(row)]
        })
        
        actual_classes_local <- apply(test_truth_local, 1, function(row) {
                c("High", "Medium", "Low")[which.max(row)]
        })
        
        # Compute confusion matrix & accuracy
        conf_mat_local <- table(Predicted = predicted_classes_local, Actual = actual_classes_local)
        fold_accuracy  <- sum(diag(conf_mat_local)) / sum(conf_mat_local)
        
        # Print confusion matrix
        cat("Confusion Matrix for this fold:\n")
        print(conf_mat_local)

        # Print fold accuracy
        cat(sprintf("Fold Accuracy: %.4f\n", fold_accuracy))

        
        return(fold_accuracy)
}

We then performed a 10-fold Cross-Validation that splits the data into 10 folds, trains the model on 9 folds, and tests on the 10th to measure accuracy across all folds. The outputs are the average performance.


```r
# Perform 10-Fold Cross-Validation 
set.seed(123)  
folds <- createFolds(Students$Exam_Score_Category, k = 10, list = TRUE)

cv_accuracies <- numeric(length(folds))

for (i in seq_along(folds)) {
        # Indices for this fold
        test_idx  <- folds[[i]]
        train_idx <- setdiff(seq_len(nrow(Students)), test_idx)
        
        # Build the training and testing splits
        fold_train_raw <- Students[train_idx, ]
        fold_test_raw  <- Students[test_idx, ]
        
        # Run the training and evaluation pipeline on this fold
        cv_accuracies[i] <- do_training_and_eval(
                train_raw_data = fold_train_raw,
                test_raw_data  = fold_test_raw
        )
        
        #cat(sprintf("Fold %d accuracy: %.4f\n", i, cv_accuracies[i]))
}
```



```
Confusion Matrix for this fold:
         Actual
Predicted High Low Medium
   High     51   0      0
   Low       0 207    131
   Medium   55   0    200
Fold Accuracy: 0.7112
```



```
Confusion Matrix for this fold:
         Actual
Predicted High Low Medium
   High     68   0      0
   Low       0 207    123
   Medium   38   0    209
Fold Accuracy: 0.7504
```



```
Confusion Matrix for this fold:
         Actual
Predicted High Low Medium
   High     37   0      0
   Low       0 207    221
   Medium   70   0    110
Fold Accuracy: 0.5488
```



```
Confusion Matrix for this fold:
         Actual
Predicted High Low Medium
   High     72   0      0
   Low       0 207    140
   Medium   34   0    191
Fold Accuracy: 0.7298
```



```
Confusion Matrix for this fold:
         Actual
Predicted High Low Medium
   High     77   0      0
   Low       0 207     78
   Medium   29   0    253
Fold Accuracy: 0.8339
```



```
Confusion Matrix for this fold:
         Actual
Predicted High Low Medium
   High    106  60    331
   Low       0  60      0
   Medium    0  87      0
Fold Accuracy: 0.2578
```



```
Confusion Matrix for this fold:
         Actual
Predicted High Low Medium
   High    106   0     67
   Low       0 108      0
   Medium    0  99    264
Fold Accuracy: 0.7422
```



```
Confusion Matrix for this fold:
         Actual
Predicted High Low Medium
   High     82   0      0
   Low       0 208     81
   Medium   24   0    250
Fold Accuracy: 0.8372
```



```
Confusion Matrix for this fold:
         Actual
Predicted High Low Medium
   High    106   0     56
   Low       0 207      0
   Medium    0   0    275
Fold Accuracy: 0.9130
```



```
Confusion Matrix for this fold:
         Actual
Predicted High Low Medium
   High     84   0      0
   Low       0 207     74
   Medium   22   0    257
Fold Accuracy: 0.8509
```

# Compute average accuracy across all 10 folds
cv_mean_accuracy <- mean(cv_accuracies)
cat("10-Fold CV Average Accuracy:", cv_mean_accuracy, "\n")
10-Fold CV Average Accuracy: 0.7175254 

The neural network achieved an average cross-validation accuracy of 71.75%, comparable to the SVM models. The confusion matrix for one fold indicates strong classification performance for “Low” and “Medium” categories, but some misclassification occurred in the “High” category. These results suggest that the neural network effectively captures complex patterns and handles imbalanced data, though further tuning could improve its ability to classify underrepresented categories like “High.” Overall, the neural network demonstrates competitive performance and robustness for this classification task.

#NN model with hidden layer 5 and 4 node, the learning rate parameter was excluded In this implementation, a neural network with two hidden layers (5 and 4 nodes) was trained to classify Exam_Score_Category (High, Medium, Low) based on predictors like Hours_Studied. The data was preprocessed by normalizing numerical features and applying one-hot encoding for the categorical target variable. SMOTE (Synthetic Minority Oversampling Technique) was used to balance the training data by oversampling the minority categories. A 10-fold cross-validation approach was used to evaluate the model’s performance, with metrics such as fold accuracy and confusion matrices calculated for each fold.

# Normalization
normalize <- function(x) {
        if (max(x) == min(x)) return(rep(0, length(x)))  
        (x - min(x)) / (max(x) - min(x))
}

one_hot_target <- function(df) {
        df %>%
                mutate(
                        High   = ifelse(Exam_Score_Category == "High", 1, 0),
                        Medium = ifelse(Exam_Score_Category == "Medium", 1, 0),
                        Low    = ifelse(Exam_Score_Category == "Low", 1, 0)
                ) %>%
                select(-Exam_Score_Category)  
}
# Define the main training and evaluation function 
do_training_and_eval <- function(train_raw_data, test_raw_data) {
        # Preprocess the train set
        train_prep_local <- train_raw_data %>%
                one_hot_target() %>%
                mutate(across(where(is.factor), as.numeric)) %>%
                mutate(across(where(is.numeric), normalize)) %>%
                mutate(
                        Category = factor(
                                case_when(
                                        High == 1   ~ "High",
                                        Medium == 1 ~ "Medium",
                                        Low == 1    ~ "Low"
                                ),
                                levels = c("High", "Medium", "Low")
                        )
                )
        
        # oversampling the training set
        train_upsampled_local <- upSample(
                x    = train_prep_local %>% select(-High, -Medium, -Low, -Category),
                y    = train_prep_local$Category,
                list = FALSE
        ) %>%
                mutate(
                        High   = ifelse(Class == "High", 1, 0),
                        Medium = ifelse(Class == "Medium", 1, 0),
                        Low    = ifelse(Class == "Low", 1, 0)
                ) %>%
                select(-Class)
        
        # Preprocess the test set
        test_prep_local <- test_raw_data %>%
                one_hot_target() %>%
                mutate(across(where(is.factor), as.numeric)) %>%
                mutate(across(where(is.numeric), normalize))
        
        # Separate input and output for test
        test_in_local    <- test_prep_local %>% select(-High, -Medium, -Low)
        test_truth_local <- test_prep_local %>% select(High, Medium, Low)
        
        # Train the neural network
        set.seed(123)
        Students_net <- neuralnet(
                High + Medium + Low ~ .,
                data = train_upsampled_local,
                hidden = c(5, 4),  # 2 hidden layers (5 nodes, then 4 nodes)
                linear.output = FALSE
        )
        # Plot the neural network
        plot(Students_net, rep = "best")
        
        # Predict on test fold
        predictions_local <- compute(Students_net, test_in_local)$net.result
        
        # Convert outputs to class labels
        predicted_classes_local <- apply(predictions_local, 1, function(row) {
                c("High", "Medium", "Low")[which.max(row)]
        })
        
        actual_classes_local <- apply(test_truth_local, 1, function(row) {
                c("High", "Medium", "Low")[which.max(row)]
        })
        
        # Compute confusion matrix & accuracy
        conf_mat_local <- table(Predicted = predicted_classes_local, Actual = actual_classes_local)
        fold_accuracy  <- sum(diag(conf_mat_local)) / sum(conf_mat_local)
        
        # Print confusion matrix
        cat("Confusion Matrix for this fold:\n")
        print(conf_mat_local)

        # Print fold accuracy
        cat(sprintf("Fold Accuracy: %.4f\n", fold_accuracy))

        
        return(fold_accuracy)
}

```r
# Perform 10-Fold Cross-Validation 
set.seed(123)  
folds <- createFolds(Students$Exam_Score_Category, k = 10, list = TRUE)

cv_accuracies <- numeric(length(folds))

for (i in seq_along(folds)) {
        # Indices for this fold
        test_idx  <- folds[[i]]
        train_idx <- setdiff(seq_len(nrow(Students)), test_idx)
        
        # Build the training and testing splits
        fold_train_raw <- Students[train_idx, ]
        fold_test_raw  <- Students[test_idx, ]
        
        # Run the training and evaluation pipeline on this fold
        cv_accuracies[i] <- do_training_and_eval(
                train_raw_data = fold_train_raw,
                test_raw_data  = fold_test_raw
        )
        
        #cat(sprintf("Fold %d accuracy: %.4f\n", i, cv_accuracies[i]))
}
```



```
Confusion Matrix for this fold:
         Actual
Predicted High Low Medium
   High     45   0      0
   Low       0 207    131
   Medium   61   0    200
Fold Accuracy: 0.7019
```



```
Confusion Matrix for this fold:
         Actual
Predicted High Low Medium
   High     68   0      0
   Low       0 207    142
   Medium   38   0    190
Fold Accuracy: 0.7209
```



```
Confusion Matrix for this fold:
         Actual
Predicted High Low Medium
   High     39   0      0
   Low       0 207    221
   Medium   68   0    110
Fold Accuracy: 0.5519
```



```
Confusion Matrix for this fold:
         Actual
Predicted High Low Medium
   High     71   0      0
   Low       0 207    148
   Medium   35   0    183
Fold Accuracy: 0.7158
```



```
Confusion Matrix for this fold:
         Actual
Predicted High Low Medium
   High     82   0      0
   Low       0 207     78
   Medium   24   0    253
Fold Accuracy: 0.8416
```



```
Confusion Matrix for this fold:
         Actual
Predicted High Low Medium
   High    106  60    331
   Low       0  60      0
   Medium    0  87      0
Fold Accuracy: 0.2578
```



```
Confusion Matrix for this fold:
         Actual
Predicted High Low Medium
   High    106   0     55
   Low       0 101      0
   Medium    0 106    276
Fold Accuracy: 0.7500
```



```
Confusion Matrix for this fold:
         Actual
Predicted High Low Medium
   High     87   0      0
   Low       0 208     76
   Medium   19   0    255
Fold Accuracy: 0.8527
```



```
Confusion Matrix for this fold:
         Actual
Predicted High Low Medium
   High    106   0     57
   Low       0 207      0
   Medium    0   0    274
Fold Accuracy: 0.9115
```



```
Confusion Matrix for this fold:
         Actual
Predicted High Low Medium
   High     78   0      0
   Low       0 207     74
   Medium   28   0    257
Fold Accuracy: 0.8416
```

# Compute average accuracy across all 10 folds
cv_mean_accuracy <- mean(cv_accuracies)
cat("10-Fold CV Average Accuracy:", cv_mean_accuracy, "\n")
10-Fold CV Average Accuracy: 0.7145768 

The neural network was trained without explicitly specifying a learning rate, and its performance was evaluated by comparing predicted class labels with the actual class labels. The average cross-validation accuracy across the 10 folds was 71.46%, and the confusion matrix for one fold indicates strong performance in classifying “Low” and “Medium” categories but weaker performance in classifying the “High” category, where more misclassifications occurred.

The neural network achieved an average cross-validation accuracy of 71.46%, which is slightly lower than the previous neural network model (71.75%) that used 7 and 5 nodes in its hidden layers and included a learning rate parameter. The confusion matrix highlights that while the model effectively distinguishes “Low” and “Medium” categories, it struggles with accurately predicting the “High” category, similar to the previous model.

#CONCLUSIONS This project evaluated various machine learning models, including linear regression, Generalized Linear Models (GLMs), Generalized Additive Models (GAMs), Support Vector Machines (SVMs) with linear and radial kernels, and neural networks, to predict student exam score categories (High, Medium, Low). The linear models, including GLMs and GAMs, performed well in capturing straightforward relationships, with GAMs slightly outperforming due to their flexibility in modeling non-linear patterns. The linear SVM achieved the highest accuracy (72.98%) for simpler relationships, while the radial SVM (72.2% accuracy) and neural networks (71.75% and 71.46% cross-validation accuracy) excelled in capturing more complex and non-linear patterns. SMOTE was used effectively in neural networks to address class imbalances, though all models struggled with accurately predicting the “High” category. The most influential factors across all models were hours studied and attendance, which consistently drove better performance. Additionally, previous scores, motivation level, and parental involvement significantly contributed to outcomes, emphasizing the combined importance of personal habits and external support systems.
Overall, all the approaches emphasized the importance of fostering good study habits, maintaining attendance, and creating supportive environments to enhance student performance. This project highlights the value of machine learning in understanding and predicting academic outcomes, providing actionable insights for educators and policymakers.